Anderson–Darling Test (Goodness-of-Fit / Normality)#

The Anderson–Darling (AD) test is a goodness-of-fit test: it asks whether a sample plausibly comes from a target distribution. It is similar in spirit to Kolmogorov–Smirnov, but it puts much more weight on the tails, which is often exactly what you care about in practice.


Learning goals#

By the end you should be able to:

  • state the null/alternative hypotheses for the one-sample AD test

  • explain why AD is tail-sensitive

  • compute the AD statistic from first principles (NumPy-only)

  • interpret the result via p-values / critical values

  • diagnose where the mismatch happens using CDF + contribution plots

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) What the AD test is used for#

The question it answers#

“If the true data-generating distribution were F, how surprising is it to see a sample whose empirical CDF looks like this?”

AD is most commonly used as a normality test (where F is a normal CDF with mean/variance estimated from the data), but it is more general: you can test against any continuous CDF you can evaluate.

Typical applications#

  • checking normality of residuals (regression, control charts, signal processing)

  • validating a fitted distribution before using it for probabilities/quantiles

  • comparing tail behavior (outliers, risk, reliability)

When you should be careful#

  • AD is derived for continuous distributions; ties/discrete data need extra care.

  • Observations should be roughly i.i.d. (dependence can inflate false positives).

  • The null distribution of the statistic depends on whether distribution parameters are known or estimated.

2) Hypotheses and what “reject” means#

For the one-sample AD test:

  • H0: observations are i.i.d. from the target distribution with CDF F(x)

  • H1: the data are not from F(x)

AD produces a statistic 0.

  • larger ⇒ larger deviation from F (especially in the tails)

  • you reject H0 when is “too large” (via a p-value or critical value)

Important: failing to reject does not prove the distribution is correct; it just means you didn’t find strong evidence against it.

3) Intuition: why AD is tail-sensitive#

Let:

  • F(x) be the hypothesized CDF

  • F_n(x) be the empirical CDF from your sample

One way to write the AD statistic is:

\[ A^2 = n \int_{-\infty}^{\infty} \frac{\big(F_n(x) - F(x)\big)^2}{F(x)\,(1-F(x))} \, dF(x) \]

The key is the weight term:

\[ w(F) = \frac{1}{F(1-F)} \]

When F is near 0 or near 1 (the tails), w(F) becomes very large, so tail mismatches are amplified.

F_grid = np.linspace(1e-3, 1 - 1e-3, 2000)
w = 1.0 / (F_grid * (1.0 - F_grid))

fig = go.Figure()
fig.add_trace(go.Scatter(x=F_grid, y=w, mode="lines", name="w(F)=1/(F(1-F))"))
fig.update_layout(
    title="AD tail weighting: w(F) blows up near 0 and 1",
    xaxis_title="F",
    yaxis_title="weight w(F)",
    yaxis_type="log",
)
fig.show()

4) The discrete formula (how you actually compute A²)#

Sort the data:

\[ x_{(1)} \le x_{(2)} \le \dots \le x_{(n)} \]

Compute CDF values under the null:

\[ F_i = F\big(x_{(i)}\big) \]

Then the one-sample AD statistic is:

\[ A^2 = -n - \frac{1}{n} \sum_{i=1}^n (2i-1)\Big(\ln(F_i) + \ln\big(1 - F_{n+1-i}\big)\Big) \]

Notes:

  • You must avoid log(0) by clipping F_i away from 0 and 1.

  • For the common normality version (unknown mean/variance), people often use a small-sample correction:

\[ A^{2\,*} = A^2 \left(1 + \frac{0.75}{n} + \frac{2.25}{n^2}\right) \]

Below we implement everything with NumPy only, including a normal CDF.

def _as_1d_finite(x):
    x = np.asarray(x, dtype=float).ravel()
    x = x[np.isfinite(x)]
    return x


def erf_approx(x):
    """Vectorized erf approximation (Abramowitz & Stegun 7.1.26)."""
    x = np.asarray(x, dtype=float)
    sign = np.sign(x)
    ax = np.abs(x)

    p = 0.3275911
    a1 = 0.254829592
    a2 = -0.284496736
    a3 = 1.421413741
    a4 = -1.453152027
    a5 = 1.061405429

    t = 1.0 / (1.0 + p * ax)
    poly = (((((a5 * t + a4) * t + a3) * t + a2) * t + a1) * t)
    y = 1.0 - poly * np.exp(-(ax ** 2))
    return sign * y


def norm_cdf(x, mu=0.0, sigma=1.0):
    """Normal CDF using erf approximation (NumPy-only)."""
    x = np.asarray(x, dtype=float)
    sigma = np.asarray(sigma, dtype=float)
    if np.any(sigma <= 0):
        raise ValueError("sigma must be positive.")
    z = (x - mu) / (sigma * np.sqrt(2.0))
    return 0.5 * (1.0 + erf_approx(z))


def ecdf(x):
    """Empirical CDF points (x_sorted, p)."""
    x = _as_1d_finite(x)
    xs = np.sort(x)
    n = xs.size
    p = np.arange(1, n + 1) / n
    return xs, p


def anderson_darling_statistic(x, cdf, args=(), eps=1e-12):
    """One-sample Anderson–Darling statistic A² for a continuous CDF F."""
    x = _as_1d_finite(x)
    n = x.size
    if n < 2:
        raise ValueError("Need at least 2 finite observations.")

    xs = np.sort(x)
    i = np.arange(1, n + 1)

    Fi = np.asarray(cdf(xs, *args), dtype=float)
    Fi = np.clip(Fi, eps, 1.0 - eps)

    s = np.sum((2 * i - 1) * (np.log(Fi) + np.log1p(-Fi[::-1])))
    return -n - s / n


def anderson_darling_contributions(x, cdf, args=(), eps=1e-12):
    """Per-order-statistic contribution to A² (excluding the '-n' constant)."""
    x = _as_1d_finite(x)
    n = x.size
    xs = np.sort(x)
    i = np.arange(1, n + 1)

    Fi = np.asarray(cdf(xs, *args), dtype=float)
    Fi = np.clip(Fi, eps, 1.0 - eps)

    term = (2 * i - 1) * (np.log(Fi) + np.log1p(-Fi[::-1]))
    contrib = -term / n
    return xs, contrib


def ad_normality(x):
    """
    AD normality test pieces (NumPy-only).

    Estimates mu,sigma from the sample and applies the common correction:
    A²* = A² * (1 + 0.75/n + 2.25/n²)
    """
    x = _as_1d_finite(x)
    n = x.size
    if n < 2:
        raise ValueError("Need at least 2 finite observations.")

    mu = x.mean()
    sigma = x.std(ddof=0)
    if sigma <= 0:
        raise ValueError("sigma is 0; AD normality test is undefined.")

    A2 = anderson_darling_statistic(x, norm_cdf, args=(mu, sigma))
    corr = 1.0 + 0.75 / n + 2.25 / (n**2)
    A2_star = A2 * corr
    return {"n": n, "mu": mu, "sigma": sigma, "A2": A2, "A2_star": A2_star, "corr": corr}


def ad_pvalue_normal_approx(A2_star):
    """
    Approximate p-value for the AD normality test (using A²*).

    This is a widely used approximation (Stephens-style). For maximum accuracy,
    especially in small samples, prefer a Monte Carlo p-value.
    """
    A2_star = float(A2_star)

    if A2_star < 0.2:
        p = 1.0 - np.exp(-13.436 + 101.14 * A2_star - 223.73 * (A2_star**2))
    elif A2_star < 0.34:
        p = 1.0 - np.exp(-8.318 + 42.796 * A2_star - 59.938 * (A2_star**2))
    elif A2_star < 0.6:
        p = np.exp(0.9177 - 4.279 * A2_star - 1.38 * (A2_star**2))
    else:
        p = np.exp(1.2937 - 5.709 * A2_star + 0.0186 * (A2_star**2))

    return float(np.clip(p, 0.0, 1.0))


def ad_null_stats_normality(n, n_sim=3000, seed=0):
    """Monte Carlo null distribution for A²* under normality (mu,sigma estimated)."""
    rng = np.random.default_rng(seed)
    stats = np.empty(n_sim, dtype=float)

    for j in range(n_sim):
        sim = rng.standard_normal(n)
        stats[j] = ad_normality(sim)["A2_star"]

    return stats

5) Worked example: data that are actually normal#

We generate a sample from a normal distribution. Under H0, we expect:

  • A²* to look typical for this sample size

  • a p-value that is not extremely small

n = 80
x_norm = rng.normal(loc=0.0, scale=1.0, size=n)

res_norm = ad_normality(x_norm)
p_approx_norm = ad_pvalue_normal_approx(res_norm["A2_star"])

null_stats = ad_null_stats_normality(n, n_sim=4000, seed=123)
p_mc_norm = float(np.mean(null_stats >= res_norm["A2_star"]))

alphas = np.array([0.15, 0.10, 0.05, 0.025, 0.01])
crit = np.quantile(null_stats, 1 - alphas)

print("Normal sample")
print({k: (float(v) if isinstance(v, (np.floating, float)) else v) for k, v in res_norm.items() if k != "corr"})
print(f"Approx p-value (normality): {p_approx_norm:.4f}")
print(f"MC p-value (normality):     {p_mc_norm:.4f}")
print()
print("Monte Carlo critical values for A²* (reject if A²* > critical):")
for a, c in zip(alphas, crit):
    print(f"  alpha={a:>5.3f}  critical={c:.4f}")
Normal sample
{'n': 80, 'mu': 0.025721174006128944, 'sigma': 0.7695278508490445, 'A2': 0.40370944560065425, 'A2_star': 0.40763615075512927}
Approx p-value (normality): 0.3479
MC p-value (normality):     0.3688

Monte Carlo critical values for A²* (reject if A²* > critical):
  alpha=0.150  critical=0.5714
  alpha=0.100  critical=0.6443
  alpha=0.050  critical=0.7715
  alpha=0.025  critical=0.8856
  alpha=0.010  critical=1.0426
# P–P plot: theoretical CDF values vs empirical probabilities
xs = np.sort(x_norm)
i = np.arange(1, n + 1)
emp_p = (i - 0.5) / n
theory_p = norm_cdf(xs, res_norm["mu"], res_norm["sigma"])

fig = go.Figure()
fig.add_trace(go.Scatter(x=theory_p, y=emp_p, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", name="y = x"))
fig.update_layout(
    title="P–P plot for normality (normal sample)",
    xaxis_title="Theoretical CDF F(xᵢ)",
    yaxis_title="Empirical probability (i−0.5)/n",
)
fig.show()
# Contribution view: where the statistic comes from
_, contrib = anderson_darling_contributions(x_norm, norm_cdf, args=(res_norm["mu"], res_norm["sigma"]))
contrib_star = contrib * res_norm["corr"]

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=np.arange(1, n + 1),
        y=contrib_star,
        mode="lines+markers",
        name="contribution",
    )
)
fig.update_layout(
    title="Where A²* comes from (normal sample)",
    xaxis_title="Order statistic index i (1=smallest, n=largest)",
    yaxis_title="Contribution to A²*",
)
fig.show()
fig = px.histogram(
    null_stats,
    nbins=60,
    title=f"Null distribution of A²* under normality (n={n}, MC={len(null_stats)})",
    labels={"value": "A²*"},
)
fig.add_vline(x=res_norm["A2_star"], line_color="red", line_width=3)
fig.add_annotation(
    x=res_norm["A2_star"],
    y=0,
    yanchor="bottom",
    text=f"observed A²*={res_norm['A2_star']:.3f}\nMC p≈{p_mc_norm:.3f}",
    showarrow=True,
    arrowhead=2,
)
fig.update_layout(showlegend=False)
fig.show()

6) Worked example: heavy tails (t distribution)#

Now we generate data with heavier tails than a normal distribution.

Because AD heavily weights the tails, it often detects this kind of deviation strongly.

x_t = rng.standard_t(df=3, size=n)

res_t = ad_normality(x_t)
p_approx_t = ad_pvalue_normal_approx(res_t["A2_star"])
p_mc_t = float(np.mean(null_stats >= res_t["A2_star"]))

print("t(3) sample (heavy tails)")
print({k: (float(v) if isinstance(v, (np.floating, float)) else v) for k, v in res_t.items() if k != "corr"})
print(f"Approx p-value (normality): {p_approx_t:.4f}")
print(f"MC p-value (normality):     {p_mc_t:.4f}")
t(3) sample (heavy tails)
{'n': 80, 'mu': 0.015710624639228243, 'sigma': 1.6263187423964756, 'A2': 2.4402203779633993, 'A2_star': 2.4639553339834337}
Approx p-value (normality): 0.0000
MC p-value (normality):     0.0000
# Empirical CDF vs fitted normal CDF (visual goodness-of-fit)
xs_emp, p_emp = ecdf(x_t)
p_fit = norm_cdf(xs_emp, res_t["mu"], res_t["sigma"])

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=xs_emp,
        y=p_emp,
        mode="lines",
        line_shape="hv",
        name="Empirical CDF",
    )
)
fig.add_trace(go.Scatter(x=xs_emp, y=p_fit, mode="lines", name="Fitted normal CDF"))
fig.update_layout(
    title="Empirical CDF vs fitted normal CDF (t(3) sample)",
    xaxis_title="x",
    yaxis_title="CDF",
)
fig.show()
# Compare contribution profiles: normal vs heavy-tailed
_, contrib_n = anderson_darling_contributions(x_norm, norm_cdf, args=(res_norm["mu"], res_norm["sigma"]))
_, contrib_t = anderson_darling_contributions(x_t, norm_cdf, args=(res_t["mu"], res_t["sigma"]))

contrib_n_star = contrib_n * res_norm["corr"]
contrib_t_star = contrib_t * res_t["corr"]

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=np.arange(1, n + 1),
        y=contrib_n_star,
        mode="lines+markers",
        name="Normal sample",
    )
)
fig.add_trace(
    go.Scatter(
        x=np.arange(1, n + 1),
        y=contrib_t_star,
        mode="lines+markers",
        name="t(3) sample",
    )
)
fig.update_layout(
    title="AD is tail-sensitive: contributions often spike in the extremes",
    xaxis_title="Order statistic index i (tails are near 1 and n)",
    yaxis_title="Contribution to A²*",
)
fig.show()

7) How to interpret the result (what it means)#

The mechanics#

  1. Pick a null distribution F (and decide whether its parameters are fixed or estimated).

  2. Compute the AD statistic (for normality, we used A²*).

  3. Convert / A²* into a decision:

    • critical value: reject if A²* > c_α

    • p-value: reject if p < α

The meaning#

  • If you reject: the sample is unlikely to come from the target distribution (given the test’s assumptions). This does not tell you the exact alternative; use the plots to see whether the mismatch is in the center, skew, or tails.

  • If you fail to reject: you don’t have strong evidence against the target distribution. It is not a proof of normality.

A practical rule of thumb#

With large n, even tiny deviations can be statistically significant. Use AD as a diagnostic alongside effect-size thinking and domain context.

8) Common pitfalls and diagnostics#

  • Parameter estimation matters: the null distribution changes if you estimate parameters from the data.

  • Outliers: AD will often flag a single extreme point (sometimes that’s desired; sometimes it’s a data-quality problem).

  • Dependence (time series, spatial data): AD assumes i.i.d.; autocorrelation can create false rejections.

  • Discrete / rounded data: ties violate the continuous-data derivation; consider simulation/permutation approaches.

  • Multiple testing: if you test many groups/features, adjust your significance threshold.

Diagnostics that pair well with AD:

  • CDF / P–P plots (global mismatch)

  • contribution plots (where in the distribution the statistic is coming from)

  • domain-specific residual checks (e.g., heteroskedasticity, seasonality)

9) Exercises#

  1. Implement a CDF for another distribution (e.g., exponential with rate λ) and run anderson_darling_statistic against it.

  2. Use ad_null_stats_normality to compute your own critical values for different sample sizes (n=20, n=200) and compare.

  3. Build two alternatives (skewed vs heavy-tailed) and compare how the contribution plots differ.

References#

  • Anderson, T. W., & Darling, D. A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes.

  • Stephens, M. A. (1974/1976). Approximations and corrections for EDF statistics (commonly used for AD p-values).

  • SciPy: scipy.stats.anderson (returns the statistic and critical values for several distributions).